IEEE INFOCOM 2024

Session F-4

F-4: Caching

Conference

8:30 AM — 10:00 AM PDT

Local

May 22 Wed, 10:30 AM — 12:00 PM CDT

Location

Regency F

On Pipelined GCN with Communication-Efficient Sampling and Inclusion-Aware Caching

Shulin Wang, Qiang Yu and Xiong Wang (Huazhong University of Science and Technology, China); Yuqing Li (Wuhan University, China); Hai Jin (Huazhong University of Science and Technology, China)

0

Graph convolutional network (GCN) has achieved enormous success in learning structural information from unstructured data. As graphs become increasingly large, distributed training for GCNs is severely prolonged by frequent cross-worker communications. Existing efforts to improve the training efficiency often comes at the expense of GCN performance, while the communication overhead persists. In this paper, we propose PSC-GCN, a holistic pipelined framework for distributed GCN training with communication-efficient sampling and inclusion-aware caching, to address the communication bottleneck while ensuring satisfactory model performance. Specifically, we devise an asynchronous pre-fetching scheme to retrieve stale statistics (features, embedding, gradient) of boundary nodes in advance, such that the embedding aggregation and model update are pipelined with statistics transmission. To alleviate communication volume and staleness effect, we introduce a variance-reduction based sampling policy, which prioritizes inner nodes over boundary ones for reducing the access frequency to remote neighbors, thus mitigating cross-worker statistics exchange. Complementing graph sampling, a feature caching module is co-designed to buffer hot nodes with high inclusion probability, ensuring that frequently sampled nodes will be available in local memory. Extensive evaluations on real-world datasets show the superiority of PSC-GCN over state-of-the-art methods, where we can reduce training time by 72%-80% without sacrificing model accuracy.

Speaker Shulin Wang

A Randomized Caching Algorithm for Distributed Data Access

Tianyu Zuo, Xueyan Tang and Bu Sung Lee (Nanyang Technological University, Singapore)

0

In this paper, we study an online cost optimization problem for distributed data access. The goal of this problem is to dynamically create and delete data copies in a multi-server distributed system as time goes, in order to minimize the total storage and network cost of serving access requests. We propose an online algorithm with randomized storage periods of data copies in the servers, and derive an optimal probability density function of storage periods, which makes the algorithm achieve a competitive ratio of \(1+\frac{\sqrt{2}}{2}\). An example is presented to show that this competitive ratio is not only tight but also asymptotic. Experimental evaluations using real data access traces demonstrate that our algorithm outperforms the best known deterministic algorithm.

Speaker

Speaker biography is not available.

CDCache: Space-Efficient Flash Caching via Compression-before-Deduplication

Hengying Xiao and Jingwei Li (University of Electronic Science and Technology of China, China); Yanjing Ren (The Chinese University of Hong Kong, Hong Kong); Ruijin Wang and Xiaosong Zhang (University of Electronic Science and Technology of China, China)

0

Large-scale storage systems boost I/O performace via flash caching, but the underlying storage medium of flash caching incurs significant costs and also exhibits low endurance. Previous studies adopt compression-after-deduplication to mitigate writing redundant contents into the flash cache, so as to address the cost and endurance issues. However, deduplication and compression have conflicting preferable cases, and compression-after-deduplication essentially compromises the space-saving benefits of either deduplication or compression. To simultaneously preserve the benefits of both approaches, we explore compression-before-deduplication, which applies compression to eliminate byte-level redundancies across data blocks, followed by deduplication to write only a single copy of duplicate compressed blocks into the flash cache. We present CDCache, a space-efficient flash caching system that realizes compression-before-deduplication. It proposes to dynamically adjust the compression range of data blocks, so as to preserve the effectiveness of deduplication on the compressed blocks. Also, it builds on various design techniques to approximately estimate duplicate data blocks and efficiently manage compressed blocks. Trace-driven experiments show that CDCache improves the read hit ratio and the write reduction ratio of a previous compression-after-deduplication approach by up to 1.3× and 1.6×, respectively, while it only has small memory overhead for index management.

Speaker Hengying Xiao (University of Electronic Science and Technology of China)

Dependency-Aware Online Caching

Julien Dallot (TU Berlin, Germany); Amirmehdi Jafari Fesharaki (Sharif University of Technology, Iran); Maciej Pacut and Stefan Schmid (TU Berlin, Germany)

0

We consider a variant of the online caching problem where the items exhibit dependencies among each other: an item can reside in the cache only if all its dependent items are also in the cache. The dependency relations can form any directed acyclic graph. These requirements arise in systems such as CacheFlow [SOSR 2016] that cache forwarding rules for packet classification in IP-based communication networks. First, we present an optimal randomized online caching algorithm which accounts for dependencies among the items. Our randomized algorithm is O(log k)-competitive, where k is the size of the cache, meaning that our algorithm never incurs the cost of O(log k) times higher than even an optimal algorithm that knows the future input sequence. Second, we consider the bypassing model, where requests can be served at a fixed price without fetching the item and its dependencies into the cache - a variant of caching with dependencies introduced by Bienkowski et al. at SPAA 2017. For this setting, we give an O(sqrt(klogk))-competitive algorithm, which significantly improves the best known competitiveness. We conduct a small case study, to find out that our algorithm incurs on average 2x lower cost.

Speaker

Speaker biography is not available.

Session Chair

Stratis Ioannidis (Northeastern University, USA)

Enter Zoom

Session F-5

F-5: Remote Direct Memory Access (RDMA)

Conference

10:30 AM — 12:00 PM PDT

Local

May 22 Wed, 12:30 PM — 2:00 PM CDT

Location

Regency F

ZETA: Transparent Zero-Trust Security Add-on for RDMA

Hyunseok Chang and Sarit Mukherjee (Nokia Bell Labs, USA)

0

While the fast adoption of RDMA in data centers has been primarily driven by its performance benefits, more and more attention is being paid to its security implication, especially with mounting security risks associated with lateral communication within data centers. However, since RDMA is implemented as NIC's fixed function, it is challenging to incorporate new security features in RDMA. In this paper, we propose ZETA, a zero-trust security addon for RoCEv2, which enables network-independent, fine-grained zero-trust security control for RDMA. It does not require any change in RDMA's ASIC implementation or application-level interfaces. To this end, ZETA leverages modern SmartNIC's versatility to perform zero-trust policy control on RDMA packets within a SmartNIC in a cryptographically secure fashion. From its prototype implementation and evaluation based on real-word applications, we show that, while cryptographic verification of ZETA introduces 1.5ms session startup latency, the overhead of end-to-end application performance is marginal (e.g., less than 1% throughput and 5% latency penalty).

Speaker

Speaker biography is not available.

Host-driven In-Network Aggregation on RDMA

Yulong Li and Wenxin Li (Tianjin University, China); Yinan Yao (TianJin University, China); Yuxuan Du and Keqiu Li (Tianjin University, China)

0

Large-scale datacenter networks are increasingly using in-network aggregation (INA) and remote direct memory access (RDMA) techniques to accelerate deep neural network (DNN) training. However, existing research trends suggest that these two techniques are on an inevitable collision course. To fill this gap, we present FreeINA, a host-driven in-network aggregation aimed at providing RDMA reliable connection (RC) for multi-tenant learning settings. FreeINA relies on dual transmission paths to support RC compatibility, with one path for INA and another one for aggregation on end-host parameter server. With dynamic control of these two paths, FreeINA can leave the traditional in-server aggregation unaffected while ensuring INA's reliability without modifying RDMA network interfaces (RNICs). To support multi-tenant learning, FreeINA employs all-reduce-level memory allocation, which can capture the well-known "on and off" DNN training pattern and thus improve switch memory efficiency. We have implemented a FreeINA prototype using P4-programmable switch and commercial RNICs, and evaluated it extensively using 100Gbps testbed. The results show that compared to the state-of-the-art solution---ATP, FreeINA improves single-job training speedup ratio by 1.20x, while improving the aggregation throughput by 2.65x in multi-job scenario.

Speaker

Speaker biography is not available.

INSERT: In-Network Stateful End-to-End RDMA Telemetry

Hyunseok Chang (Nokia Bell Labs, USA); Walid A. Hanafy (University of Massachusetts Amherst, USA); Sarit Mukherjee and Limin Wang (Nokia Bell Labs, USA)

0

Remote Direct Memory Access (RDMA) has been widely adopted in modern data centers thanks to its high-throughput, low-latency data transfer capability and reduced CPU overhead. However, traditional network-flow-based monitoring is poor at interpreting RDMA-based communication and hence inadequate for gaining insights. In this paper, we present INSERT, an end-to-end RDMA telemetry system that enables seamless visibility on RDMA-based communication from network-layer all the way to application-layer. To this end, INSERT combines (i) eBPF-based transparent RDMA tracing on end-hosts and (ii) stateful RDMA network telemetry on programmable data plane. We implement RDMA network telemetry on programmable SmartNICs, where we address practical challenges for maintaining fine-grained state on massively-parallel packet processing pipelines. We demonstrate that INSERT can perform reasonably accurate telemetry at line-rate for different types of RDMA traffic even in the presence of out-of-order packets, and finally showcase two practical use cases that can benefit from INSERT.

Speaker

Speaker biography is not available.

RB\(^2\): Narrow the Gap between RDMA Abstraction and Performance via a Middle Layer

Haifeng Sun, Yixuan Tan, Yongtong Wu, Jiaqi Zhu and Qun Huang (Peking University, China); Xin Yao and Gong Zhang (Huawei Technologies Co., Ltd., China)

0

Although the native RDMA interface allows for high throughput and low latency, its low-level abstraction raises significant programming challenges. Consequently, numerous systems encapsulate the RDMA interface into more user-friendly high-level abstractions such as Socket, MPI, and RPC. However, this ease of development often incurs considerable performance degradation. To address this trade-off, this paper introduces RB\(^2\), a high-performance RDMA-based Distributed Ring Buffer (DRB). RB\(^2\) serves as a middle layer that effectively conceals the low-level details of the RDMA interface while also facilitating extension to other high-level abstractions.

Nonetheless, it is non-trivial for DRBs to preserve the RDMA performance. We optimize the performance of RB\(^2\) in three aspects. First, we perform micro-benchmarks to identify the pointer synchronization methods that are seemingly counter-intuitive but offer optimal performance improvements. Second, we propose an adaptive batching mechanism to alleviate the limitations of conventional fixed batching. Finally, we build an efficient memory subsystem using various optimization techniques. RB\(^2\) outperforms SOTA designs by achieving 2.5× to 7.5× throughput while maintaining comparable tail latency for small messages.

Speaker

Speaker biography is not available.

Session Chair

Sangheon Pack (Korea University, Korea (South))

Enter Zoom

Session F-6

F-6: Video Delivery and Analytics

Conference

1:30 PM — 3:00 PM PDT

Local

May 22 Wed, 3:30 PM — 5:00 PM CDT

Location

Regency F

AdaStreamer: Machine-Centric High-Accuracy Multi-Video Analytics with Adaptive Neural Codecs

Andong Zhu, Sheng Zhang, Ke Cheng, Xiaohang Shi, Zhuzhong Qian and Sanglu Lu (Nanjing University, China)

0

Increased videos captured by widely deployed cameras are being analyzed by computer vision-based Deep Neural Networks (DNNs) on servers rather than being streamed for humans. Unfortunately, the conventional codecs (e.g., H.26x and MPEG-x) originally designed for video streaming lack content-aware feature extraction and hinder machine-centric video analytics, making it difficult to achieve the required high accuracy with tolerable delay. Neural codecs (e.g., autoencoder) now hold impressive compression performance and have been widely advocated in video streaming. While autoencoder shows transformative potential, the application in video analytics is hampered by low accuracy in detecting small objects of high-resolution videos and the serious challenges posed by multi-video streaming. To this end, we propose AdaStreamer with adaptive neural codecs to enable real machine-centric high-accuracy multi-video analytics. We also investigate how to achieve optimal accuracy under delay constraints via careful scheduling in Compression Ratios (CRs, the ratio of the compressed size to the original data size) and bandwidth allocation, and further propose a Markov-based Adaptive Compression and Bandwidth Allocation algorithm (MACBA). We have practically developed a prototype of AdaStreamer, based on which extensive experiments verify its accuracy improvement (up to 15%) compared to state-of-the-art coding and streaming solutions.

Speaker

Speaker biography is not available.

AggDeliv: Aggregating Multiple Wireless Links for Efficient Mobile Live Video Delivery

Jinlong E (Renmin University of China, China); Lin He and Zongyi Zhao (Tsinghua University, China); Yachen Wang, Gonglong Chen and Wei Chen (Tencent, China)

0

Mobile live-streaming applications with stringent latency and bandwidth requirements have gained tremendous attention in recent years. Encountered with bandwidth insufficiency and congestion instability of the wireless uplinks, multi-access networking provides opportunities to achieve fast and robust connectivity. However, the state-of-the-art multi-path transmission solutions are lack of adaptivity to the heterogeneous and dynamic nature of wireless networks. Meanwhile, the indispensable video coding and transformation bring about extra latency and make the video delivery vulnerable to network throughput fluctuation. This paper presents AggDeliv, a framework that provides efficient and robust multi-path transmission for mobile live video delivery. The key idea is to relate multi-path packet scheduling to congestion control optimization over diverse wireless links and adapt it to the mobile video characteristics. This is achieved by probabilistic packet allocation based on links' congestion windows, wireless-oriented delay and loss aware congestion control, as well as lightweight video frame coding and network-adaptive frame-packet transformation. Real-world evaluations demonstrate that our framework significantly outperforms the state-of-the-art solutions on aggregate goodput and streaming video bitrate.

Speaker Jinlong E (Renmin University of China)

He is currently a lecturer at Renmin University of China. His research interests include cloud/edge computing, mobile streaming media, and AIoT.

BiSwift: Bandwidth Orchestrator for Multi-Stream Video Analytics on Edge

Lin Sun (Nanjing University, China); Weijun Wang (Tsinghua University, China); Tingting Yuan (Georg-August-University of Göttingen, Germany); Liang Mi (Nanjing University, China); Haipeng Dai (Nanjing University, China & State Key Laboratory for Novel Software Technology, China); Yunxin Liu (Tsinghua University, China); Xiaoming Fu (University of Goettingen, Germany)

0

High-definition (HD) cameras for surveillance and road traffic have experienced tremendous growth, demanding intensive computation resources for real-time analytics. Recently, offloading frames from the front-end device to the back-end edge server has shown great promise. In multi-stream competitive environments, efficient bandwidth management and proper scheduling are crucial to ensure both high inference accuracy and high throughput. To achieve this goal, we propose BiSwift, a bi-level framework that scales the concurrent real-time video analytics by a novel adaptive hybrid codec integrated with multi-level pipelines, and a global bandwidth controller for multiple video streams. The lower-level front-back-end collaborative mechanism (called adaptive hybrid codec) locally optimizes the accuracy and accelerates end-to-end video analytics for a single stream. The upper-level scheduler aims to accuracy fairness among multiple streams via the global bandwidth controller. The evaluation of BiSwift shows that BiSwift is able to real-time object detection on 9 streams with an edge device only equipped with an NVIDIA RTX3070 (8G) GPU. BiSwift improves 10%~21% accuracy and presents 1.2~9 times throughput compared with the state-of-the-art video analytics pipelines.

Speaker Weijun Wang (Tsinghua University)

Weijun Wang is currently a Postdoc Fellow in AIoT Group from Institute for AI Industry Research, Tsinghua University, China. His research area is Edge LLM, especially Efficiently Serving Large Vision Models on Edge. Weijun Wang received his dual Ph.D. degrees respectively from Nanjing University, China, and the University of Göttingen, Germany. He was a Researcher at the University of Göttingen from 2022 to 2023.

Crucio: End-to-End Coordinated Spatio-Temporal Redundancy Elimination for Fast Video Analytics

Andong Zhu, Sheng Zhang, Xiaohang Shi, Ke Cheng, Hesheng Sun and Sanglu Lu (Nanjing University, China)

0

Video Analytics Pipeline (VAP) usually relies on traditional codecs to stream video content from clients to servers. However, such analytics-agnostic codecs preserve considerable pixels not relevant to achieving high analytics accuracy, incurring a large end-to-end delay. Despite the significant efforts of pioneers, they fall short as they resisted complete redundancy elimination. Achieving such a goal is extremely challenging, and naive design without coordination can result in the benefits of redundancy elimination being counterbalanced by intolerable delays introduced. We present Crucio, an end-to-end coordinated spatio-temporal redundancy elimination system for edge video analytics. Crucio leverages reshaped asymmetric autoencoders for end-to-end frame filtering (temporally) and coordinated intra-frame (spatially), inter-frame (temporally) compression. Furthermore, Crucio can decode the compressed key frames all in one go and support adaptive VAP batch size for delay optimization. Extensive evaluations reveal significant end-to-end delay reductions (at least 31% under an accuracy target of 0.9) in Crucio compared to the state-of-the-art VAP redundancy elimination methods (e.g., DDS, Reducto, STAC, etc).

Speaker

Speaker biography is not available.

Session Chair

Sabur Baidya (University of Louisville, USA)

Enter Zoom

Session F-7

F-7: Data Center Networking

Conference

3:30 PM — 5:00 PM PDT

Local

May 22 Wed, 5:30 PM — 7:00 PM CDT

Location

Regency F

RateMP: Optimizing Bandwidth Utilization with High Burst Tolerance in Data Center Networks

Jiangping Han, Kaiping Xue and Wentao Wang (University of Science and Technology of China, China); Ruidong Li (Kanazawa University, Japan); Qibin Sun and Jun Lu (University of Science and Technology of China, China)

0

Load balancing in data center networks (DCNs) is a crucial and complex undertaking. Multi-path TCP (MPTCP) has been proposed as a cost-effective solution that aims to distribute workloads and improve network resource utilization. However, it can escalate buffer occupancy and undermine burst tolerance, particularly in scenarios involving incast short flows.
To address these limitations, we propose a novel multi-path congestion control algorithm, RateMP, to optimize bandwidth utilization efficiency while ensuring burst tolerance in DCNs. RateMP employs a hybrid window and rate control loop with coupled gradient projection adjustment, enabling fast and fine-grained bandwidth allocation and accelerating convergence. Additionally, RateMP eliminates the limitation of cwnd with under-rate pacing to protect incast and busty flows.
We prove that RateMP is Lyapunov stable and asymptotically stable, and show the improvement of RateMP through a kernel-based implementation and extended large-scale simulations. RateMP keeps high bandwidth utilization, cuts RTT by 2x and reduces flow completion times (FCT) by 45\% in incast scenarios compared to existing algorithms.

Speaker Jiangping Han (University of Science and Technology of China)

Jiangping Han received her bachelor's degree and the doctor's degree both from the Department of Electronic Engineering and Information Science (EEIS), USTC, in 2016 and 2021, respectively. From Nov. 2019 to Oct. 2021, She was a visiting scholar with the School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, she was a Post-Doctoral researcher with the School of Cyber Science and Technology, USTC. She is currently an associate researcher with the School of Cyber Science and Technology, USTC. Her research interests include future Internet architecture design and transmission optimization.

Rearchitecting Datacenter Networks: A New Paradigm with Optical Core and Optical Edge

Sushovan Das, Arlei Silva and T. S. Eugene Ng (Rice University, USA)

0

All-optical circuit-switching (OCS) technology is the key to design energy-efficient and high-performance datacenter network (DCN) architectures for the future. However, existing round-robin based OCS cores perform poorly under realistic workloads having high traffic skewness and high volume of inter-rack traffic. To address this issue, we propose a novel DCN architecture OSSV: a combination of OCS-based core (between ToR switches) and OCS-based reconfigurable edge (between servers and ToR switches). On one hand, the OCS core is traffic agnostic and realizes reconfigurably non-blocking ToR-level connectivity. On the other hand, OCS-based edge reconfigures itself to reshape the incoming traffic in order to jointly minimize traffic skewness and inter-rack traffic volume. Our novel optimization framework can obtain the right balance between these inter-twined objectives. Our extensive simulations and testbed evaluation show that OSSV can achieve high performance under diverse DCN traffic while consuming low power and incurring low cost.

Speaker

Speaker biography is not available.

BiCC: Bilateral Congestion Control in Cross-datacenter RDMA Networks

Zirui Wan, Jiao Zhang and Mingxuan Yu (Beijing University of Posts and Telecommunications, China); Junwei Liu and Jun Yao (Chinamobile Cloud Centre, China); Xinghua Zhao (China Mobile (Suzhou) Software Technology Co., Ltd, China); Tao Huang (Beijing University of Posts and Telecommunications, China)

0

With the development of network-intensive applications like machine learning and cloud storage, there are two growing trends: (i) RDMA has been widely deployed to enhance underlying high-speed networks; (ii) applications are deployed on geographically distributed datacenters to meet customer demands (e.g., low access latency to services or regular data backups). To fully utilize the benefits of RDMA, we desire to support long-haul RDMA transport for cross-datacenter applications. Different from common intra-datacenter communications, the hybrid of long-haul and intra-datacenter traffic complicates the congestion state, and the considerably long control loop makes it more severe. We revisit existing congestion control methods and find they are insufficient to address the hybrid traffic congestion.

In this paper, we propose Bilateral Congestion Control (BiCC), a novel solution relying on two-side DCI-switches to bilaterally alleviate the hybrid traffic congestion in the sender-side and receiver-side datacenter while serving as a building block for existing host-driven methods. We implement BiCC on commodity P4-based switches and conduct evaluations using both testbed experiments and NS3 simulations. The extensive evaluation results show that BiCC ensures fast congestion avoidance. Thus, BiCC reduces the average FCT for intra-datacenter and interdatacenter traffic by up to 53% and 51%, respectively, in largescale simulations.

Speaker Zirui Wan

Zirui Wan is the fourth year phd student, from Beijing University of Posts and Telecommunications, advised by professor Jiao Zhang, where he get his bachelor's degree in 2020. His research interests are the transport protocols in different networks, including datacenter networks and intra-host networks.

Explicit Dropping Notification in Data Centers

Qingkai Meng (Beihang University, China); Yiran Zhang (Beijing University of Posts and Telecommunication, China); Chaolei Hu, Bo Wang and Fengyuan Ren (Tsinghua University, China)

0

Datacenter applications increasingly demand microsecond-scale latency and tight tail latency. Despite recent advances in datacenter transport protocols, we notice that the timeout caused by packet loss is the killer of microsecond-scale latency. Moreover, refining the RTO setting is impractical due to the significant fluctuations in RTT. In this paper, we propose explicit dropping notification (EDN) to avoid timeouts. EDN rekindles ICMP Source Quench, where the switch notifies the source of precise packet loss information. Then the source can rapidly pinpoint dropped packets for fast retransmission instead of waiting for timeouts. More importantly, fast retransmission does not mean immediate retransmission which is prone to aggravate congestion and deteriorate latency. In light of this, we suggest finessing the timing and sending rate of retransmission. Specifically, as a reward of the paradigm shift to explicit notification, the source can pause for the queue draining time piggybacked on EDN messages and estimate connection capacity to figure out a proper sending rate, thus avoiding congestion aggravation. We implement EDN on the P4-programmable switching ASIC and Linux kernel. Evaluations show that, compared with state-of-the-art loss recovery schemes, EDN reduces the latency by up to 4.1\(\times\) on average and 3.6\(\times\) at the 99th-percentile.

Speaker Qingkai Meng(Beihang University)

Session Chair

Chunyi Peng (Purdue University, USA)

Enter Zoom

Program at a Glance